database search
pUniFind: a unified large pre-trained deep learning model pushing the limit of mass spectra interpretation
Zhao, Jiale, Mao, Pengzhi, Wang, Kaifei, Li, Yiming, Peng, Yaping, Chen, Ranfei, Lu, Shuqi, Ji, Xiaohong, Ding, Jiaxiang, Zhang, Xin, Liao, Yucheng, E, Weinan, Zhang, Weijie, Wen, Han, Chi, Hao
Deep learning has advanced mass spectrometry data interpretation, yet most models remain feature extractors rather than unified scoring frameworks. We present pUniFind, the first large-scale multimodal pre-trained model in proteomics that integrates end-to-end peptide-spectrum scoring with open, zero-shot de novo sequencing. Trained on over 100 million open search-derived spectra, pUniFind aligns spectral and peptide modalities via cross modality prediction and outperforms traditional engines across diverse datasets, particularly achieving a 42.6 percent increase in the number of identified peptides in immunopeptidomics. Supporting over 1,300 modifications, pUniFind identifies 60 percent more PSMs than existing de novo methods despite a 300-fold larger search space. A deep learning based quality control module further recovers 38.5 percent additional peptides including 1,891 mapped to the genome but absent from reference proteomes while preserving full fragment ion coverage. These results establish a unified, scalable deep learning framework for proteomic analysis, offering improved sensitivity, modification coverage, and interpretability.
Foundation model for mass spectrometry proteomics
Sanders, Justin, Yilmaz, Melih, Russell, Jacob H., Bittremieux, Wout, Fondrie, William E., Riley, Nicholas M., Oh, Sewoong, Noble, William Stafford
Mass spectrometry is the dominant technology in the field of proteomics, enabling high-throughput analysis of the protein content of complex biological samples. Due to the complexity of the instrumentation and resulting data, sophisticated computational methods are required for the processing and interpretation of acquired mass spectra. Machine learning has shown great promise to improve the analysis of mass spectrometry data, with numerous purpose-built methods for improving specific steps in the data acquisition and analysis pipeline reaching widespread adoption. Here, we propose unifying various spectrum prediction tasks under a single foundation model for mass spectra. To this end, we pre-train a spectrum encoder using de novo sequencing as a pre-training task. We then show that using these pre-trained spectrum representations improves our performance on the four downstream tasks of spectrum quality prediction, chimericity prediction, phosphorylation prediction, and glycosylation status prediction. Finally, we perform multi-task fine-tuning and find that this approach improves the performance on each task individually. Overall, our work demonstrates that a foundation model for tandem mass spectrometry proteomics trained on de novo sequencing learns generalizable representations of spectra, improves performance on downstream tasks where training data is limited, and can ultimately enhance data acquisition and analysis in proteomics experiments.
- North America > United States (0.28)
- Europe > Belgium > Flanders > Antwerp Province > Antwerp (0.04)
Disentangling the Complex Multiplexed DIA Spectra in De Novo Peptide Sequencing
Ma, Zheng, Mao, Zeping, Zhang, Ruixue, Chen, Jiazhen, Xin, Lei, Shan, Paul, Ghodsi, Ali, Li, Ming
Data-Independent Acquisition (DIA) was introduced to improve sensitivity to cover all peptides in a range rather than only sampling high-intensity peaks as in Data-Dependent Acquisition (DDA) mass spectrometry. However, it is not very clear how useful DIA data is for de novo peptide sequencing as the DIA data are marred with coeluted peptides, high noises, and varying data quality. We present a new deep learning method DIANovo, and address each of these difficulties, and improves the previous established system DeepNovo-DIA by from 25% to 81%, averaging 48%, for amino acid recall, and by from 27% to 89%, averaging 57%, for peptide recall, by equipping the model with a deeper understanding of coeluted DIA spectra. This paper also provides criteria about when DIA data could be used for de novo peptide sequencing and when not to by providing a comparison between DDA and DIA, in both de novo and database search mode. We find that while DIA excels with narrow isolation windows on older-generation instruments, it loses its advantage with wider windows. However, with Orbitrap Astral, DIA consistently outperforms DDA due to narrow window mode enabled. We also provide a theoretical explanation of this phenomenon, emphasizing the critical role of the signal-to-noise profile in the successful application of de novo sequencing.
5 Models for Conversational AI
How can chatbots become truly intelligent by combining five different models of conversation? Conversational AI is all about making machines communicate with us in natural language. They are called using various names -- chatbots, voice bots, virtual assistants, etc. In reality, they may be slightly different to each other. However one key feature that ties them all together is their ability to understand natural language commands and requests from us-human users. In the back-end, these agents will have to deal with carrying out the request and engage in a conversation.